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Abstract 

In this paper we formulate in general terms an approach to prove 
strong consistency of the Empirical Risk Minimisation inductive princi- 
ple applied to the prototype or distance based clustering. This approach 
was motivated by the Divisive Information- Theoretic Feature Clustering 
model in probabilistic space with Kullback-Leibler divergence which may 
be regarded as a special case within the Clustering Minimisation frame- 
work. Also, we propose clustering regularization restricting creation of 
additional clusters which are not significant or are not essentially differ- 
ent comparing with existing clusters. 



1 Introduction 

Clustering algorithms group data according to the given criteria. For example, 
it may be a model based on Spectral Clustering [10| or Prototype Based model 

i- 

In this paper we consider a Prototype Based approach which may be de- 
scribed as follows. Initially, we have to choose k prototypes. Corresponding 
empirical clusters will be defined in accordance to the criteria of the nearest 
prototype measured by the distance $. Respectively, we will generate initial k 
clusters. As a second Minimisation step we will recompute cluster centers or 
^-means [3] using data strictly from the corresponding clusters. Then, we can 
repeat Clustering step using new prototypes obtained from the previous step 
as a cluster centers. Above algorithm has descending property. Respectively, it 
will reach local minimum in a finite number of steps. 
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Pollard [TT] demonstrated that the classical A'-means algorithm in with 
squared loss function satisfies the Key Theorem of Learning Theory [M], p. 36, 
"the minimal empirical risk must converge to the minimal actual risk". 

A new clustering algorithm in probabilistic space was proposed in [5] . It 
provides an attractive approach based on the Kullback Leibler divergence. The 
above methodology requires a general formulation and framework which we will 
present in the following Section [2l 

Section [3] extends the methodology of [TT] in order to cover the case of 
■p™ with Kullback Leibler divergence. Using the results and definitions of the 
Section [31 we investigate relevant properties of "P™ in the Section [J and prove 
a strong consistence of the Empirical Risk Minimisation inductive principle. 

Determination of the number of clusters k represents an important prob- 
lem. For example, [J proposed the G-means algorithm which is based on the 
Gaussian fit of the data within particular cluster. Usually attempts to esti- 
mate the number of Gaussian clusters will lead to a very high value of k [T5] . 
Most simple criteria such as AIC {Akaike Information Criterion [^ ) and BIC 
(Bayesian Information Criterion [12] . [5]) either overestimate or underestimate 
the number of clusters, which severely limits their practical usability. We in- 
troduce in Section [5] special clustering regularization. This regularization will 
restrict creation of a new cluster which is not big enough and which is not 
sufficiently different comparing with existing clusters. 

2 Prototype Based Approach 

In this paper we will consider a sample of i.i.d. observations X :— {xi, . . . , Xn} 
drawn from probability space {X, A, P) where probability measure P is assumed 
to be unknown. 

Key in this scenario is an encoding problem. Assuming that we have a 
codebook Q € with prototypes q{c) indexed by the code c = 1, . . . , fc, the 
aim is to encode any x & X hy some q{c{x)) such that the distortion between x 
and q{c{x)) is minimized: 



where •) is a loss function. 

Using criterion ([1]) we split empirical data into k clusters. As a next step 
we compute the cluster center specifically for any particular cluster in order to 
minimise overall distortion error. 

We estimate actual distortion error 



c(x) := argmincC{x,q{c)) 



(1) 



srW[Q] :=E £-ix,Q) 



(2) 



by the empirical error 




(3) 



t=i 
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where C{x, Q) := C{x, q{c{x))). 

The following Theorem, which may be proved similarly to the Theorems 4 
and 5 of [S], formulates the most important descending and convergence prop- 
erties within the Clustering Minimisation (CM) framework: 

Theorem 1 The CM-algorithm includes 2 steps: Clustering Step: recom- 
pute c{x) according to (QP for a fixed prototypes from the given codebook Q, 
which will he updated as a cluster centers from the next step, 

Minimisation Step: recompute cluster centers for a fixed mapping c{x) or 
minimize the objective function (0) over Q, and 

1 ) monotonically decreases the value of the objective function (0); 

2) converges to a local minimum in a finite number of steps if Minimisation 
Step has exact solution. 

We define an optimal actual codebook Q by the following condition: 

3?('=)(Q) := inf 3?('=)(Q). (4) 

The following relations are valid 

^i%[Qn]<'Sii''2M 3f^i'Jp[2]^5i('=)[Q] a.s. (5) 
where Q„ is an optimal empirical codebook: 

5iSp(S„) ^nf {K^pCQ)}- (6) 

The main target is to demonstrate asymptotical {almost .sure) convergence 

a.s. (n ^ oo) . (7) 

In order to prove ([7]) we define in Section |3] general model which has direct 
relation to the model in probabilistic space with with KL divergence [S] . 

The proof of the main result which is formulated in the Theorem [5] includes 
two steps: 

(1) by Lemma [T] we prove existence of hq such that Q„ C F for all n > uq 
where subset T C X satisfies condition: C{x,q) < oo for all x d X,q d F; 
and 

(2) by Lemma [2] we prove (under some additional constraints of general na- 
ture) 

sup \^i%[Q] - ^'^''^QW ^ a.s. (8) 
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3 General Theory and Definitions 

In this section we employ some ideas and methods proposed in and which 
cover the case of M™ with loss function C{x,q) :— (p{\\x — q\\) where ip is a 
strictly increasing function. 

Let us assume that the following structural representation with P-integrable 
vector-functions ^ and 77 is valid 

m 

Cix, q) ^'(^) • = ^0 Vx, 9 e A'. (9) 

i=0 

Let us define subsets of X as extensions of the empirical clusters: 
Xc{Q) := {a; e A" : c = argmin^ C{x, q{i))} , 

Then, we can re-write ([2]) as follows 

Ji('=)[Q]:=^(e(A'c),^(<?(c))) (10) 

C 

where ^{A) J^^{x)P{dx), A e A. 

We define a ball with radius r and a corresponding reminder in X 

B{r) = {qeX : C{x, q) < r, Vx G X}, (11a) 
T{r)^X\B{r), r > ro, (lib) 
ro = inf{r > : B{r) ^ 0}. (11c) 

The following properties are valid 

(c(Ai) -e(A2), 77(g)) >o (12) 

for all q e X and any Ai, A2 e A : A2 C Ai; 

{aX),v{q))<r VqeB{r). (13) 

Suppose, that 

nnu)) 0. (14) 

(7— ^-oo 

The following distances will be used below: 

p{Ai,A2):= inf inf C{ai,a2), Ai, A2 e A; (15) 

aieAia2eA2 

fi{Ai,A2)^ inf sup C{ai,a2),Ai,A2 e A. (16) 
Suppose, that 

p{Bir),T{U)) 00 (17) 



4 



for any fixed Tq < r < oo. 

Remark 1 We assume that 

T{U) ^ (18) 

for any fixed U : Vq < U < oo, alternatively, the following below Lemma [T] 
become trivial. 

Lemma 1 Suppose, that the structure of the loss function C is defined in 
under condition JiTp . Probability distribution P satisfies condition and the 
number of clusters k > 1 is fixed. Then, we can select large enough radius 
Z : < Z < oo and no > 1 such that all components of the optimal empirical 
codebook Q„ defined in 0) will be within the ball B{Z): Q„ C B{Z) if sample 
size is large enough: \/n > uq. 

Proof: Existence of the element a. G X such that 

i?a = K(i)({a}) = (e(A'),r;(a))<oo (19) 

follows from ^ and (Hi)) . 
Suppose that 

P(B(r)) = Po > 0, r>ro. (20) 
We can construct B{V) in accordance with condition (IT71) and p^ : 

V = mf{v>r: p[B{r),T{v)) > ^^^^}, e > 0. (21) 

Pq 

Suppose, there are no empirical prototypes within B{V). Then, in accordance 
with definition (PT|) 

^i%[Qn] >D^ + e>D^yn> 0. 

Above contradicts to and ([5]). Therefore, at least one prototype from Q„ 
must be within B{V) if n is large enough (this fact is valid for Q as well). 
Without loss of generality we assume that 

<Z(1) e B{V). (22) 

The proof of the Lemma has been completed in the case if fc = 1. Following the 
method of mathematical induction, suppose, that k > 2 and 

sft(fc-i)(g)_sRW(g) >e>0. (23) 

Then, we define a ball B{U) by the following conditions 

U = M{u>V : sup {^{T{u)),ri{q)) < e}. (24) 

qeB{V) 

Existence of the C/ : 1^ < < oo in ^ follows from ^ and ([T4|) . 
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By definition of tlie distance /i and ball B{V) 

<V{U,V) ^ fi{T{U),B{V)) <V < oo. (25) 

Now, we can define reminder T{Z) ^ in accordance with condition ([T7)) : 

Z = inf {z>U : p{B{U), T{z)) > V{U, V)}. (26) 

Suppose, that there is at least one prototype within T{Z), for example, ^(2) S 
T{Z). On the other hand, we know about (|22p. Let us consider what will 
happen if we will remove q{2) from the optimal empirical codebook Q„ (the 
case of optimal actual risk Q may be considered similarly) and will replace it 
by g(l): 

(1) as a consequence of ([25l) and ((26|) all empirical data within B{U) are closer 
to q{l) anyway, means the data from B(U) will not increase empirical (or 
actual) risk ([3]); 

(2) by definition, X = B{U)LlT{U), B{U)nT{U) = and in accordance with 
the condition an empirical risk increases because of the data within 
T{U) must be strictly less compared with e for all large enough n > Uq 
(actual risk increase will be strictly less compared with e for all n > 1). 

Above contradicts to the condition (|23|) and ([5]). Therefore, all prototypes from 
Q must be within F = B{Z) for all n > 1, and Q„ C F if n is large enough. ■ 

3.1 Uniform Strong Law of Large Numbers (SLLN) 

Let F denote the family of P-integrable functions on X . 

A sufficient condition for uniform SLLN ([5]) is: for each (5 > there exists a 
finite class Ts E T such that to each C ^ J- there are functions C and C £ J-s 
with the following 2 properties: 

C{x) < C{x) < C{x) for ah x € X; (Z(x) - £(x)) V{dx) < d. 

We shall assume here existence of the function ip such that 

h(q)|| <^(Z) <oo (27) 

for all q e B{Z) where Tq < Z < oo. 

Lemma 2 Suppose that the number of clusters k is fixed and the loss function 
C is defined by (0) under conditions ji^Tp and 

Uix)\\ < R < oo VxeX. (28) 

Then, the asymptotical relation ^ is valid for any F — B{Z),rQ < Z < oo. 
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Proof: Let us consider the definition of HausdorfF metric H in R'"+-'^: 
'H{Ai,A2)= sup inf\\ai~a2\\, 

ai£Aia2£A2 

and denote by a subset in M™+^ which was obtained from F as a result of 
?7-transformation. According to the condition ^7} . Q represents a compact set. 
It means, existence of a finite subset Qs for any 6 > Q such that ^{Q, Gs) < 
where R is defined in (^51 . We denote by C F subset which corresponds to 
Gs C G according to the 77-transformation. Respectively, we can define transfor- 
mation (according to the principle of the nearest point) fs from F to F^, and 
Qs = fsiQ) where closeness may be tested independently for any particular 
component of Q, that means absolute closeness. 

In accordance with the Cauchy-Schwartz inequality, the following relations 
take place 

£ = £{x, Qs) --< C{x, Q) < C{x, Qs) + - = C V.t e X. 

Finally, {C{x, Qs) - C{x, Qs)) V{dx) < 6 where Qs e F^' is the absolutely 
closest codebook for the arbitrary Q eT''. ■ 



4 A Probabilistic Framework 

Following [S], we assume that the probabilities pu = Pi^l^t) ,jyi^i P^t = l,i = 
represent relations between observations Xt and attributes or classes 
£ = 1, . . . , m, TO > 2. 

Accordingly, we will define probabilistic space of all m-dimensional prob- 
ability vectors with Kullhack-Leihler (KL) divergence: 

KL{v, u) -.^y^ ve ■ log — = (u, log -) v,u e V"". 
^-^ Up u 

I 

Graphical Example. Figure 1(a) illustrates first two coordinates of the syn- 
thetic data in . Third coordinate is not necessary because it is a function of 
the first two coordinates. 

Remark 2 As it was demonstrated in cluster centers gc in the space 
with XL-divergence must be computed using if -means: 

'^^-^Y.V^ (29) 

where c(xt) = c if a;^ e and ric — 4F^c is the number of observations in the 
cluster Xc,c = 1, . . . , fc, pt = {pit, . . . ,Pmt},qc = {git, ■ • ■,qmt}- 

In difference to the model of [TT] in M™, the structure © covers an important 
case of V™ with XL-divergence: 

m 

Co(«) = X! logwf; 6(w) = vg; (30) 
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Figure 1: (a-b) 3D Probability data with 4 and 6 cluster centers, n — 8000; 
(c) convergence of the CM algorithm based on KL divergence in the case of 
6 clusters; (d) behavior of the empirical error (jSj (blue, dashed) and empirical 
error with cost term (|35|) where a = 0.1, /3 = 0.03. 



8 



770 (w) = 1; rii{u) = ~\ogue,i ^ 1, ■ . ■ ,m. 



Definition. We will call element v G V"^ as 1) uniform center ii ve — -^t^ = 
1, . . . , m; as 2) absolute margin if min^ V( = 0. 

Proposition 1 The ball B{Z) C 'P™ contains only one element named as uni- 
form center in the case if Z = Tq = log (m), and B{Z) — 9 if Z < Tq. 

Proof: Suppose, that u is a uniform center. Then, KL{v, u) — 'YTiLi log^i + 
logm < logm for all v G P™. In any other case, one of the components of u 
must be less than Respectively, we can select corresponding component of 
the probability vector v as 1. Therefore, KL{v,u) > log (m) and Tq — log (to). 



Lemma 3 The KL divergence in probabilistic space "P'" always satisfies condi- 
tion i28\} where vector-function ^ is expressed by i3U\) with the following upper 
bounds: 

1^0(1^)1 < log (to); \Uv)\<l,i^l,...,ni, MveV^. 
Lemma 4 The following relations are valid in V™" 

(1) min£{u{} < e~^ for all u G T{r) Vr > Tq; 

(2) ug > e^'' for all £ — 1, . . . ,to, and any u G B{r) Vr > Tq. 

Proof: As far as P™ = B{r) U T{r),B{r) n T(r) = 0, the first statement 
may be regarded as consequence of the second. Suppose, that u G B{r) and 
ui = e^^^'^.e > 0. Then, we can select vi = 1, and KL{v,u) = r + e > r - 
contradiction. ■ 

Corollary 1 The KL divergence in always satisfies conditions 1^7] ) and 
- log (to) + Z • e-'- < p{B(r),T{Z)) < e'^ • (Z - r) + (l - e-'') log 

1 — e ^ 

for all To < r < Z where the distance p is defined in hi 5]) . 

Proof: Suppose, that v G B{r) and u G T{Z). Then, -^'}2^iVi\og{ui) > 
Z ■ e~^ for all r : < r < Z. On the other hand, the entropy H{v) = 
— ''^i log i^i) may not be smaller comparing with log (to). The low bound 
is proved. In order to prove the upper bound we shall suppose without loss of 
generality that vi — e^^ , ui — , and all other components are proportional. 
■ 

Theorem 2 Suppose that probability measure P satisfies condition |_Z.^| ) in prob- 
abilistic space V™' with KL divergence and number of clusters k is fixed. Then, 
the minimal empirical error ^Bj) will converge to the minimal actual error ^ 
with probability 1 or a.s. 
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Proof: Follows directly from the Lemmas [1] [21 [3] and |H 

Remark 3 Condition ([T4)) will not be valid if and only if a probability of the 
subset of all absolute margins is strictly positive. Note that in order to avoid 
any problems with consistency we can generalise definition of ifL-divergence 
using special smoothing parameter < 9 < 1: 

KLe{v,u) = KL{ve,ue) 

where vg — 9v + (1 — d)vo and ug = 6u + {1 — 0)vq, vq is uniform center. 



5 Clustering Regularization 

Let us introduce the following definitions: 



Qc := — V q := - ft = y]pc ■ q, 

xteXc KtGX c=l 

H(q,) := -(qe,logqe); i/(X) 1 V i/( 

rj ^ — ^ 



Xt) 

n — ' 

where = ^(XJ = ^, c = 1, . . . , fc, and H{xt) = -{pt,\ogPt)- 

We define in this section a regularisation to restrict usage of unnecessary 
clusters. This regularisation is based on the following two conditions: 

CI) pc > a > 0,c — 1, . . . ,k (significance of any particular cluster); 

C2) KLS{qi,c[c) ■— KL{qi,qc) + KL{qcj'ii) > /3 > {difference between any 
2 clusters i and c,i ^ c). 

According to [Bj , if more prototypes are used for the fc- means clustering, the 
algorithm splits clusters, which means that it represents a single cluster by more 
than one prototype. The following Proposition [2] considers clustering procedure 
in an inverse direction. 

Proposition 2 The following representations are valid 

k 

= -i/(X) + ^p,iJ(q,); 5R«p - KiSp = ^Pc/^L(q„ q), Vn > fc > 1. 

C C— 1 

Proof: In accordance with above definitions 



and 

c c 

where the second equation follows directly from the first one. ■ 
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Corollary 1 Assuming that we merge first r clusters, 1 < t < fc, the following 
relation is valid 

^Rifn7+'^ - 5?Slp < jZPcKLi^c, q.), q. = (31) 

t[ Ec=lPc 

Remark 4 First r clusters were chosen in order to simplify notifications and 
without loss of generality. 

As a result of standard application of Jensen's inequality to pip we can 
formulate similar results in terms of particular differences between clusters. 

Corollary 2 The following relation is valid 

sa(k-T+l) _ sp(fc) ^ J2c=l PiPcKL{qi, Qc) . , 

■^cmp ■^cmp — {'^■^J 

for any n > k > 2. 

As a direct consequence of ((32)) . we derive formula for the case of two clusters 
indexed by i and c: 

s»(fc-i) _ st>(fc) < Pi-Pc- KLS{q^,qc) , , 

Pi ~r Pc 

The coefficient in (155)1 represents an increasing function of probabilities 

Pi and Pc ^ ct- Respectively, we form regularized empirical risk by including 
additional cost term in ([B]): 

C2p[Qn] + C(fc) (34) 

where 

Cik) = (35) 

Minimizing above regularized empirical risk as a function of number of clusters 
k we will make required selection of the clustering size (see Figures [TJd)). 

Remark 5 Note a structural similarity between and Akaike Information 

Criterion ^ll and 121, which has different grounds. In accordance with AIC, the 
empirical log-likelihood is greater compared with the actual log-likelihood because 
we use the same data in order to estimate the required parameters. Asymptoti- 
cally, the bias represents a linear function of the number of the used parameters. 



6 Concluding Remarks 

Cluster analysis, an unsupervised learning method [TB], is widely used to study 
the structure of the data when no specific response variable is specified. Re- 
cently, several new clustering algorithms (e.g., graph-theoretical clustering, model- 
based clustering) have been developed with the intention to combine and im- 
prove the features of traditional clustering algorithms. However, clustering al- 
gorithms are based on different assumptions, and the performance of each clus- 
tering algorithm depends on properties of the input dataset. Therefore, the 
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winning clustering algorithm does not exist for all datasets, and the optimiza- 
tion of existing clustering algorithms is still a vibrant research area [3] . 

Probabilistic space with /CL-divergence represents an essentially different 
case compared with Euclidean space with standard squared metric. In this 
paper we considered an illustration with a simple synthetic example. However, 
many real-life datasets may be transferred into probabilistic space as a result of 
the proper normalisation. For example, we know that all elements of the colon 
datasellj are strictly positive. We can normalise any row of the colon matrix 
(which has interpretation as a gene) by division by the sum of the corresponding 
elements. As a next step, we can apply the model of Section|3]in order to reduce 
dimensionality of the gene expression data. This analysis has an important role 
to play in the discovery, validation and understanding of various classes and 
subclasses of cancer [9]. 
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